This assignment is for ETC5521 Assignment 1 by Team numbat comprising of Aarathy Babu, Lachlan Moody, Dilinie Seimon, and Jinhao Luo.

1 Introduction and motivation

2020 was a bad year for passwords. A recent audit of the ‘dark web’ reported on by Forbes unveiled that over 15 billion stolen logins were currently circulating online Winder, 2020. As stated in the article, for perspective, this represents two sets of account logins for every person on the planet.

This was the result of more than 100,000 data breaches relating to cybercrime activities, a 300% increase since 2018. So in an age where everybody is leaving an ever growing digital record of their activities from social media to banking, what can the average person do to bolster their security online?

The following analysis will explore this current issue in depth using a compilation of some of the most commonly used passwords on the web. It should be noted however that the original data was compiled in September of 2014. There is a possibility therefore that the trends and findings discussed below are not entirely applicable to the modern day. To ensure full relevancy a more up to date collection would be required. However, it is reasonable to assume the underlying foundations of password security have not changed all that much in the past few years. Additionally, the strength rating provided is calculated relative to all the other passwords in the data set. As laid out in the provided documentation, as these common passwords are mostly all ‘bad’, a high strength rating does not necessarily indicate that a password is hard to crack. However, there are additional variables that allow this to be calculated. Detailed information of the data used and the research questions formulated are provided in the following section.

2 Data description

Based upon the motivations discussed above, the following research questions were formulated. The primary subject of interest being:

What are the characteristics of the most common passwords in the interest of security?

Once this exploration area was established, five questions were composed to parameterise the proceeding analysis. They were:

  1. What are the common trends among the most commonly used passwords?
  2. How strong are the common passwords?
  3. Is high strength related to longer password crack time ?
  4. Is there a relationship among the online and offline password crack times ?
  5. How are the types of characters associated with the strength of password?

A further analysis of the relationship of the online and offline cracking times of each password will be done in order to understand the underlying factors that might be impacting them.

An analysis of the relationship between the type of characters included in the password such as numbers, special characters, uppercase-lowercase letters, or combination, and the strength of the password will also be done, which might give us a more clear understanding of the reasoning behind the strength of a password.

In order to address these areas and explore the field in greater depth, data was sourced from the book Information is Beautiful (2014). This contained information on 507 passwords derived from online databases Skullsecurity and DigiNinja collected in 2014. The data was provided in a tidy format and was read into R Studio in a csv format directly from the GitHub repository provided by Tidy Tuesday (2020) using the readr (2018) package. Table 2.1 describes each variable included in the dataset.

Table 2.1: Data dictionary of passwords dataset
Variable Description
rank popularity of password
password actual text of the password
category password type category
value time to crack password by online guessing
time_unit unit of time for corresponding value
offline_crack_sec time to crack offline in seconds
rank_alt alternative value for rank (same value as rank in all cases)
strength relative strength of password from 1 to 10
font_size used externally to create graphic for Knowledge is Beautiful (2014)

A visualisation of the data structure can be seen below in Figure 1 using the visdat package (2017).

Initial Data Structure

Figure 2.1: Initial Data Structure

Figure 2.1 highlighted two areas that the data needed to be altered. Firstly, the variable category was recoded to a factor variable rather than a character as this was determined to be a categorical variable. Secondly, there appeared to be some missing observations in the dataset. This was examined further in Figure 2.2, produced using the naniar (2020) package, which showed that all these values were evenly distributed across the tail end of the data set.
Missing Data Values

Figure 2.2: Missing Data Values

On further investigation there appeared to be 7 blank rows at the end of the dataset. These observations were subsequently removed using dplyr (2020) as they may have negatively impacted the proceeding analysis and provided no tangible value. The final resulting data frame had 500 observations of 9 variables.

3 Analysis and findings

3.2 How strong are the common passwords?

The strength of these common passwords is an interesting feature to explore as variable strength is relative to the passwords in the dataset. Since these are commonly used passwords, their strength is expected to be less and easier to crack. The following analysis has been done to explore the dataset, to determine how strong the passwords are. Through out the analysis, the variable offline_crack_sec (the time taken to crack the password by offline guessing) is considered instead of the variable value, which depicts the time taken to crack the password by online guessing, as both of these values are proportional to each other and the results remain the same during comparisons between passwords.

Figure 3.5: 43.6 % of the passwords are relatively high in strength

In figure 3.5 above, it can be seen that about 43.6% of the commonly used passwords are passwords with relative strength between 8 and 10 on scale of 1-10 with 10 being the highest quality among these passwords. 35.4% of the passwords fall in the medium category having relative strength between 6 and 8 where as 9.2% have a weak strength of 4-6. Very Weak category passwords of strength 0-4 constitute around 8.8% of the passwords given. Around 3% of the top 500 common passwords are of strength above 10 which is an interesting outlier because it varies greatly from the strength scale limits of 1-10 set in the dataset description. Since these vary greatly from typical password strength, the passwords with strength more than 10 will not be included in the data analysis.

Another important characteristic to judge a password is to analyze the time taken to crack it. In order to analyze the time taken to crack the popular passwords, the top 10 common passwords are taken into consideration by using rank as a variable. As seen in figure 3.6, passwords like ‘1234’, ‘12345’, ‘123456’ and ‘12345678’ are so popular that it is very easily cracked taking approximately 0 seconds. Given the argument that popularity of the passwords is the reason that the passwords are predictable and therefore easily cracked, it is also interesting to see that “password” even though being the ranked one in popularity, is among passwords like ‘football’ and ‘baseball’ that take relatively more time to be cracked.

Figure 3.6: As expected, 1234 is quick to be cracked

The passwords in the dataset belong to 10 different categories like simple-alphanumeric, animal etc. To find which types of passwords are the strongest, the analysis focuses on the password strength and the time taken to crack them. In order to see the distribution of the strength of the passwords belonging to each category, a density plot is drawn below in figure 3.7 using the ggridges (2020) package. A median line is drawn so as to compare the strength across the categories. It can be seen from the plot that password types such as names, sport, cool-macho and nerdy-pop are much higher in strength than the other categories as 50 % of these passwords have strength higher than 8.

Password categories like 'simple-alphanumeric' have low strength compared to other categories

Figure 3.7: Password categories like ‘simple-alphanumeric’ have low strength compared to other categories

For further investigation of evidences to determine which type of passwords are among the strongest, time to crack the passwords are also evaluated. In order to do so the mean of the variable ‘offline_crack_sec’ is plotted against each category in the figure 3.8 below. The figure below shows an interesting pattern that shows the category ‘rebellious-rude’ passwords on an average takes the longest time to be hacked even though the median of the strength of its passwords are not as much as password types like ‘nerdy-pop’ and ‘sport’. A similar pattern is seen in ‘password-related’ type. It can also be seen that types like ‘fluffy’, even though it has high strength, the average time to hack its password is quite low.

Password categories like 'simple-alphanumeric','fluffy' and 'food' are few of the weak categories

Figure 3.8: Password categories like ‘simple-alphanumeric’,‘fluffy’ and ‘food’ are few of the weak categories

To answer the question of which password type is the strongest among these passwords, it can be said that ‘rebellious-rude’, ‘cool-macho’ type passwords are good contenders.

3.4 Is there a relationship among the online and offline password crack times ?

4 Conclusion

Through the exploratory data analysis of the dataset on Top 500 commonly used passwords, it was observed that most people tend to choose passwords that can be easily remembered, therefore a simple password that is related to a name or contains alphanumeric characters and roughly 6-7 characters long is chosen as password. On further exploration it was found that 43.6 % of the commonly used passwords are relatively high in strength and that around 3% of the passwords were of very high strength which varied greatly from typical passwords.

Furthermore ,it was observed that among the password categories, types ‘rebellious-rude’, ‘cool-macho’ are considered strong and take relatively more time to get hacked. Another striking discovery made while analyzing the data is that the hacking time and the strength of the passwords in the dataset is not under any strict relationship and that not all passwords with high strength take long to be cracked and also, not all passwords with low strength are cracked easily as there have been instances of high strength password being hacked quicker than a low strength password.

It can be concluded that most people choose common passwords that can be easily hacked and that using any of the passwords in the dataset is not recommended.

References

Aden-Buie, Garrick. 2020. Ggpomological: Pomological Plot Themes for Ggplot2. https://github.com/gadenbuie/ggpomological.

Cheng, Joe. 2020. Crosstalk: Inter-Widget Interactivity for Html Widgets. https://CRAN.R-project.org/package=crosstalk.

Cheng, Joe, Carson Sievert, Winston Chang, Yihui Xie, and Jeff Allen. 2020. Htmltools: Tools for Html. https://CRAN.R-project.org/package=htmltools.

“Elegant Visualization of Density Distribution in R Using Ridgeline - Datanovia.” 2020. 2020. https://www.datanovia.com/en/blog/elegant-visualization-of-density-distribution-in-r-using-ridgeline/.

Fellows, Ian. 2018. Wordcloud: Word Clouds. https://CRAN.R-project.org/package=wordcloud.

“Knowledge Is Beautiful, My New Book — Information Is Beautiful.” 2020. McCandleless, D. 2020. http://www.informationisbeautiful.net/2014/knowledge-is-beautiful/.

“Password Analyser - Digininja.” 2020. Wood, R. 2020. https://digi.ninja/projects/pipal.php.

“Passwords - Skullsecurity.” 2020. 2020. https://wiki.skullsecurity.org/Passwords.

“Pie Charts.” 2020. 2020. https://plotly.com/r/pie-charts/.

“Rfordatascience/Tidytuesday.” 2020. Mock T. 2020. https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-01-14.

Sievert, Carson. 2020. Interactive Web-Based Data Visualization with R, Plotly, and Shiny. Chapman; Hall/CRC. https://plotly-r.com.

Tierney, Nicholas. 2017. “Visdat: Visualising Whole Data Frames.” JOSS 2 (16): 355. https://doi.org/10.21105/joss.00355.

Tierney, Nicholas, Di Cook, Miles McBain, and Colin Fay. 2020. Naniar: Data Structures, Summaries, and Visualisations for Missing Data. https://CRAN.R-project.org/package=naniar.

Wickham, Hadley. 2016a. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.

———. 2016b. Ggplot2: Elegant Graphics for Data Analysis. springer.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

Wickham, Hadley, Jim Hester, and Romain Francois. 2018. Readr: Read Rectangular Text Data. https://CRAN.R-project.org/package=readr.

Wickham, Hadley, and Dana Seidel. 2020. Scales: Scale Functions for Visualization. https://CRAN.R-project.org/package=scales.

Wilke, Claus O. 2020. Ggridges: Ridgeline Plots in ’Ggplot2’. https://CRAN.R-project.org/package=ggridges.

Xie, Yihui, Joe Cheng, and Xianying Tan. 2020. DT: A Wrapper of the Javascript Library ’Datatables’. https://CRAN.R-project.org/package=DT.

Zhu, Hao. 2019. KableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.